56 research outputs found

    Dissecting complex transcriptional responses using pathway-level scores based on prior information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The genomewide pattern of changes in mRNA expression measured using DNA microarrays is typically a complex superposition of the response of multiple regulatory pathways to changes in the environment of the cells. The use of prior information, either about the function of the protein encoded by each gene, or about the physical interactions between regulatory factors and the sequences controlling its expression, has emerged as a powerful approach for dissecting complex transcriptional responses.</p> <p>Results</p> <p>We review two different approaches for combining the noisy expression levels of multiple individual genes into robust pathway-level differential expression scores. The first is based on a comparison between the distribution of expression levels of genes within a predefined gene set and those of all other genes in the genome. The second starts from an estimate of the strength of genomewide regulatory network connectivities based on sequence information or direct measurements of protein-DNA interactions, and uses regression analysis to estimate the activity of gene regulatory pathways. The statistical methods used are explained in detail.</p> <p>Conclusion</p> <p>By avoiding the thresholding of individual genes, pathway-level analysis of differential expression based on prior information can be considerably more sensitive to subtle changes in gene expression than gene-level analysis. The methods are technically straightforward and yield results that are easily interpretable, both biologically and statistically.</p

    RNAcontext: A New Method for Learning the Sequence and Structure Binding Preferences of RNA-Binding Proteins

    Get PDF
    Metazoan genomes encode hundreds of RNA-binding proteins (RBPs). These proteins regulate post-transcriptional gene expression and have critical roles in numerous cellular processes including mRNA splicing, export, stability and translation. Despite their ubiquity and importance, the binding preferences for most RBPs are not well characterized. In vitro and in vivo studies, using affinity selection-based approaches, have successfully identified RNA sequence associated with specific RBPs; however, it is difficult to infer RBP sequence and structural preferences without specifically designed motif finding methods. In this study, we introduce a new motif-finding method, RNAcontext, designed to elucidate RBP-specific sequence and structural preferences with greater accuracy than existing approaches. We evaluated RNAcontext on recently published in vitro and in vivo RNA affinity selected data and demonstrate that RNAcontext identifies known binding preferences for several control proteins including HuR, PTB, and Vts1p and predicts new RNA structure preferences for SF2/ASF, RBM4, FUSIP1 and SLM2. The predicted preferences for SF2/ASF are consistent with its recently reported in vivo binding sites. RNAcontext is an accurate and efficient motif finding method ideally suited for using large-scale RNA-binding affinity datasets to determine the relative binding preferences of RBPs for a wide range of RNA sequences and structures

    Detecting microRNA binding and siRNA off-target effects from expression data.

    Get PDF
    Sylamer is a method for detecting microRNA target and small interfering RNA off-target signals in 3' untranslated regions from a ranked gene list, sorted from upregulated to downregulated, after a microRNA perturbation or RNA interference experiment. The output is a landscape plot that tracks occurrence biases using hypergeometric P-values for all words across the gene ranking. We demonstrated the utility, speed and accuracy of this approach on several datasets

    Spatio-Temporal Dynamics of Yeast Mitochondrial Biogenesis: Transcriptional and Post-Transcriptional mRNA Oscillatory Modules

    Get PDF
    Examples of metabolic rhythms have recently emerged from studies of budding yeast. High density microarray analyses have produced a remarkably detailed picture of cycling gene expression that could be clustered according to metabolic functions. We developed a model-based approach for the decomposition of expression to analyze these data and to identify functional modules which, expressed sequentially and periodically, contribute to the complex and intricate mitochondrial architecture. This approach revealed that mitochondrial spatio-temporal modules are expressed during periodic spikes and specific cellular localizations, which cover the entire oscillatory period. For instance, assembly factors (32 genes) and translation regulators (47 genes) are expressed earlier than the components of the amino-acid synthesis pathways (31 genes). In addition, we could correlate the expression modules identified with particular post-transcriptional properties. Thus, mRNAs of modules expressed “early” are mostly translated in the vicinity of mitochondria under the control of the Puf3p mRNA-binding protein. This last spatio-temporal module concerns mostly mRNAs coding for basic elements of mitochondrial construction: assembly and regulatory factors. Prediction that unknown genes from this module code for important elements of mitochondrial biogenesis is supported by experimental evidence. More generally, these observations underscore the importance of post-transcriptional processes in mitochondrial biogenesis, highlighting close connections between nuclear transcription and cytoplasmic site-specific translation

    PeakRegressor Identifies Composite Sequence Motifs Responsible for STAT1 Binding Sites and Their Potential rSNPs

    Get PDF
    How to identify true transcription factor binding sites on the basis of sequence motif information (e.g., motif pattern, location, combination, etc.) is an important question in bioinformatics. We present “PeakRegressor,” a system that identifies binding motifs by combining DNA-sequence data and ChIP-Seq data. PeakRegressor uses L1-norm log linear regression in order to predict peak values from binding motif candidates. Our approach successfully predicts the peak values of STAT1 and RNA Polymerase II with correlation coefficients as high as 0.65 and 0.66, respectively. Using PeakRegressor, we could identify composite motifs for STAT1, as well as potential regulatory SNPs (rSNPs) involved in the regulation of transcription levels of neighboring genes. In addition, we show that among five regression methods, L1-norm log linear regression achieves the best performance with respect to binding motif identification, biological interpretability and computational efficiency

    A Primer on Regression Methods for Decoding cis-Regulatory Logic

    Get PDF
    The rapidly emerging field of systems biology is helping us to understand the molecular determinants of phenotype on a genomic scale [1]. Cis-regulatory elements are major sequence-based determinants of biological processes in cells and tissues [2]. For instance, during transcriptional regulation, transcription factors (TFs) bind to very specific regions on the promoter DNA [2,3] and recruit the basal transcriptional machinery, which ultimately initiates mRNA transcription (Figure 1A). Learning cis-Regulatory Elements from Omics Data A vast amount of work over the past decade has shown that omics data can be used to learn cis-regulatory logic on a genome-wide scale [4-6]--in particular, by integrating sequence data with mRNA expression profiles. The most popular approach has been to identify over-represented motifs in promoters of genes that are coexpressed [4,7,8]. Though widely used, such an approach can be limiting for a variety of reasons. First, the combinatorial nature of gene regulation is difficult to explicitly model in this framework. Moreover, in many applications of this approach, expression data from multiple conditions are necessary to obtain reliable predictions. This can potentially limit the use of this method to only large data sets [9]. Although these methods can be adapted to analyze mRNA expression data from a pair of biological conditions, such comparisons are often confounded by the fact that primary and secondary response genes are clustered together--whereas only the primary response genes are expected to contain the functional motifs [10]. A set of approaches based on regression has been developed to overcome the above limitations [11-32]. These approaches have their foundations in certain biophysical aspects of gene regulation [26,33-35]. That is, the models are motivated by the expected transcriptional response of genes due to the binding of TFs to their promoters. While such methods have gathered popularity in the computational domain, they remain largely obscure to the broader biology community. The purpose of this tutorial is to bridge this gap. We will focus on transcriptional regulation to introduce the concepts. However, these techniques may be applied to other regulatory processes. We will consider only eukaryotes in this tutorial

    The value of position-specific priors in motif discovery using MEME

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Position-specific priors have been shown to be a flexible and elegant way to extend the power of Gibbs sampler-based motif discovery algorithms. Information of many types–including sequence conservation, nucleosome positioning, and negative examples–can be converted into a prior over the location of motif sites, which then guides the sequence motif discovery algorithm. This approach has been shown to confer many of the benefits of conservation-based and discriminative motif discovery approaches on Gibbs sampler-based motif discovery methods, but has not previously been studied with methods based on expectation maximization (EM).</p> <p>Results</p> <p>We extend the popular EM-based MEME algorithm to utilize position-specific priors and demonstrate their effectiveness for discovering transcription factor (TF) motifs in yeast and mouse DNA sequences. Utilizing a discriminative, conservation-based prior dramatically improves MEME's ability to discover motifs in 156 yeast TF ChIP-chip datasets, more than doubling the number of datasets where it finds the correct motif. On these datasets, MEME using the prior has a higher success rate than eight other conservation-based motif discovery approaches. We also show that the same type of prior improves the accuracy of motifs discovered by MEME in mouse TF ChIP-seq data, and that the motifs tend to be of slightly higher quality those found by a Gibbs sampling algorithm using the same prior.</p> <p>Conclusions</p> <p>We conclude that using position-specific priors can substantially increase the power of EM-based motif discovery algorithms such as MEME algorithm.</p

    c-REDUCE: Incorporating sequence conservation to detect motifs that correlate with expression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Computational methods for characterizing novel transcription factor binding sites search for sequence patterns or "motifs" that appear repeatedly in genomic regions of interest. Correlation-based motif finding strategies are used to identify motifs that correlate with expression data and do not rely on promoter sequences from a pre-determined set of genes.</p> <p>Results</p> <p>In this work, we describe a method for predicting motifs that combines the correlation-based strategy with phylogenetic footprinting, where motifs are identified by evaluating orthologous sequence regions from multiple species. Our method, c-REDUCE, can account for variability at a motif position inferred from evolutionary information. c-REDUCE has been tested on ChIP-chip data for yeast transcription factors and on gene expression data in <it>Drosophila</it>.</p> <p>Conclusion</p> <p>Our results indicate that utilizing sequence conservation information in addition to correlation-based methods improves the identification of known motifs.</p

    Thermodynamic State Ensemble Models of cis-Regulation

    Get PDF
    A major goal in computational biology is to develop models that accurately predict a gene's expression from its surrounding regulatory DNA. Here we present one class of such models, thermodynamic state ensemble models. We describe the biochemical derivation of the thermodynamic framework in simple terms, and lay out the mathematical components that comprise each model. These components include (1) the possible states of a promoter, where a state is defined as a particular arrangement of transcription factors bound to a DNA promoter, (2) the binding constants that describe the affinity of the protein–protein and protein–DNA interactions that occur in each state, and (3) whether each state is capable of transcribing. Using these components, we demonstrate how to compute a cis-regulatory function that encodes the probability of a promoter being active. Our intention is to provide enough detail so that readers with little background in thermodynamics can compose their own cis-regulatory functions. To facilitate this goal, we also describe a matrix form of the model that can be easily coded in any programming language. This formalism has great flexibility, which we show by illustrating how phenomena such as competition between transcription factors and cooperativity are readily incorporated into these models. Using this framework, we also demonstrate that Michaelis-like functions, another class of cis-regulatory models, are a subset of the thermodynamic framework with specific assumptions. By recasting Michaelis-like functions as thermodynamic functions, we emphasize the relationship between these models and delineate the specific circumstances representable by each approach. Application of thermodynamic state ensemble models is likely to be an important tool in unraveling the physical basis of combinatorial cis-regulation and in generating formalisms that accurately predict gene expression from DNA sequence

    Linking Proteomic and Transcriptional Data through the Interactome and Epigenome Reveals a Map of Oncogene-induced Signaling

    Get PDF
    Cellular signal transduction generally involves cascades of post-translational protein modifications that rapidly catalyze changes in protein-DNA interactions and gene expression. High-throughput measurements are improving our ability to study each of these stages individually, but do not capture the connections between them. Here we present an approach for building a network of physical links among these data that can be used to prioritize targets for pharmacological intervention. Our method recovers the critical missing links between proteomic and transcriptional data by relating changes in chromatin accessibility to changes in expression and then uses these links to connect proteomic and transcriptome data. We applied our approach to integrate epigenomic, phosphoproteomic and transcriptome changes induced by the variant III mutation of the epidermal growth factor receptor (EGFRvIII) in a cell line model of glioblastoma multiforme (GBM). To test the relevance of the network, we used small molecules to target highly connected nodes implicated by the network model that were not detected by the experimental data in isolation and we found that a large fraction of these agents alter cell viability. Among these are two compounds, ICG-001, targeting CREB binding protein (CREBBP), and PKF118–310, targeting β-catenin (CTNNB1), which have not been tested previously for effectiveness against GBM. At the level of transcriptional regulation, we used chromatin immunoprecipitation sequencing (ChIP-Seq) to experimentally determine the genome-wide binding locations of p300, a transcriptional co-regulator highly connected in the network. Analysis of p300 target genes suggested its role in tumorigenesis. We propose that this general method, in which experimental measurements are used as constraints for building regulatory networks from the interactome while taking into account noise and missing data, should be applicable to a wide range of high-throughput datasets.National Science Foundation (U.S.) (DB1-0821391)National Institutes of Health (U.S.) (Grant U54-CA112967)National Institutes of Health (U.S.) (Grant R01-GM089903)National Institutes of Health (U.S.) (P30-ES002109
    corecore